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Many complex systems have been shown to share universal properties of organization, such as scale inde- 
pendence, modularity and self- similarity. We borrow tools from statistical physics in order to study structural 
preferential attachment (SPA), a recently proposed growth principle for the emergence of the aforementioned 
properties. We study the corresponding stochastic process in terms of its time evolution, its asymptotic behavior 
and the scaling properties of its statistical steady state. Moreover, approximations are introduced to facilitate the 
modelling of real systems, mainly complex networks, using SPA. Finally, we investigate a particular behavior 
observed in the stochastic process, the peloton dynamics, and show how it predicts some features of real growing 
systems using prose samples as an example. 
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I. INTRODUCTION 

In a recent contribution, we have proposed a model of net- 
work organization 1 1 1 based on a generalization of the clas- 
sical preferential attachment principle (PA) |2, 3 | to a higher 
order: structural preferential attachment (SPA). In this model, 
elements of the system join and create structures. In all at- 
tachment events, both the element and the structure involved 
are chosen proportionally to their past activities. Elements can 
represent money being invested, written words, individuals in 
a social network, proteins or websites, while the structures 
can be business firms, semantic fields, friendships and com- 
munities, protein complexes or types of activities and interest 

SPA can be described by the following stochastic process 
(see Fig. [T]for a visual aid). At every time step, an element 
joins a structure. With probability q, the element is a new 
one; or with probability 1 - ^, it is chosen among existing 
elements proportionally to the current number of structures 
to which they belong (i.e., their membership number). More- 
over, with probability p, the structure is a new one of size s; or 
with probability 1 - p, it is chosen among existing structures 
proportionally to the current number of elements they possess 
(i.e., their size). Whenever the structure is a new one, the re- 
maining s- \ elements involved in its creation are once again 
preferentially chosen among existing nodes. The basic struc- 
ture size s is called the system base and refers to the smallest 
structural unit of the system. For example, if s - 1, the sys- 
tem base is simply the elements themselves and we refer to 
this version as node-based SPA, while if i = 2, the system 
base is a pair of elements resulting in link-based SPA. 

This stochastic process can either be seen as a scheme of 
throwing balls (the elements) in bins (the structures) or as a 
process of network growth. In the latter, the elements are the 
nodes of the network while the structures represent significant 
topological patterns, motifs, modules or communities, within 
which elements are linked. 

SPA results in the growth of modular systems, because 
modules (or structures) are the basic building blocks of the 
model. These systems are also scale-free, in the sense that 



their main statistical features (membership and size distribu- 
tions) converge toward power laws (free of any characteris- 
tic scale) as a result of the preferential attachment principle 
||2][3l. Finally, these systems are said to be self-similar as dif- 
ferent levels of organization follow the same general behav- 
ior: elements are interconnected with one another by sharing 
structures in the same way the structures themselves are inter- 
connected by sharing elements. 

In this paper, we borrow tools from statistical physics to 
study SPA in detail. In Sec. |ll] an exact description of SPA is 
obtained by writing the corresponding discrete stochastic pro- 
cess. From this description, we obtain the statistical steady- 
state of the resulting system with asymptotic expressions for 



its scaling behaviors. In Sec. Ill some useful approximations 



are introduced and studied in order to facilitate the comparison 
between systems produced by SPA and real-world systems, 
using the cond-mat arXiv co-author network as an example. 
In order to investigate the validity of these approximations, 
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FIG. 1: (color online). A step of node-based SPA. 



2 



we then study the existence of correlations between elements 
and structures, in both the SPA process and in the cond-mat 



arXiv. Lastly, in Sec. IV we highlight an interesting behav- 



ior of discrete PA processes, which we call the peloton dy- 
namics, by comparing the initial stochastic process with an 
explicit solution for the time evolution of the continuous time 
version (further details are presented in the Appendices A and 
B). We then seek empirical evidences of this behavior in grow- 
ing prose samples. A conclusion summarizes our results. 



n. STOCHASTIC PROCESS 



A. Time evolution 

To follow the growth of a system as prescribed by the SPA 
process, we separate elements and structures. We distinguish 
nodes by their respective number of memberships, m, and 
structures by their respective size, n, as these are the only 
features relevant to their evolution. Let ^^(0 be the mean 
number of elements (or nodes to use the network terminol- 
ogy) with m memberships and S„{t) be the mean number of 
structures of size n. Throughout the paper, tildes are used in 
quantities describing absolute numbers. Also note that as we 
follow the mean distribution of these quantities, we restrict 
ourselves to a deterministic approximation of the process. 

At each time step, the evolution of these quantities is 
twofold: first, a constant increment for potential new nodes 
and structures; second, an operation corresponding to the pref- 
erential growth of existing nodes and structures. More clearly, 
each time step corresponds to an iteration of the following 
rule: 

N,n{t + 1) =A?„(f) + q6„,i 

+ /n^^^/'~/?i \{m-\)N,^i{t) -niN^it)] (1) 

f [1 +p(s-\)\ L J 

S„{t+l) =S„it)+p6„, 

+ ,n ^7' 1M f(«-l) Vi(f) -nS„(t)] . (2) 

t[l +p(s-l)\ L J 

The two increments qd,,,] and p6„s, where Sij is the Kronecker 
delta, correspond to birth events for elements (with one mem- 
bership) and structures (of size s), respectively. The last in- 
crements correspond to the growth of old entities, where a 
compartment has a negative eff'ect on itself and a positive ef- 
fect on its neighboring compartment (e.g., .^V,,, — » Nm+i) at a 
given rate and the denominator f [1 + p{s - 1)] normalizes the 
preferential attachment probabilities. 

This iterative description is straightforward, yet we can de- 
fine the system in closed form by using generating functions 
(GFs) |6|. We define two functions whose power series coef- 
ficients correspond to the elements of our two ensembles: 

jV(x;f) = ^A^m(f)-^'" and S(x;t) ^ Y^S„(t)x" (3) 



In terms of these GFs, Eqs. ([T]) and ([2]) can be rewritten as: 

N(x-j+l) = il + —x{x-l)—]N{x-j) + qx; (4) 
\ f dxj 

Six;t+1) = ( 1 + —X (x-1) 4-]^(x; t) + px' , (5) 
\ f dxj 



where we have also introduced 

\-q + p(s-\) 
To - — : — and = 



l-p 



^+p{s-\) 



(6) 



A similar description can be obtained in terms of the cor- 
responding probability generating functions (PGFs), N{x\ t) 
and S(x; t), which generate the distributions of memberships 
per element and size per structures respectively. To transform 
the previous description in terms of these PGFs, note that the 
mean numbers of elements, .^V,,,, or structures, S„, in a given 
state corresponds to the proportion of such elements, Nm, or 
structures, S„, multiplied by the mean total number of ele- 
ments, qt, or structures, pt, expected at time t. One can now 
rewrite Eqs. Q and (jSj) in terms of N{x; t) and Six; t) by 
multiplying these functions by qt and pt, respectively: 



{t + \) N{x;t + \) = \t + T,x(x-\) — \N(x;t) + x (7) 



{t+l)S{x;t+l) = \t + D.,x{x-l) — \S{x;t) + x'.{%) 
\ dxj 

As we will see in what follows, the description in terms of 
PGFs is generally more useful and will hereafter be used in 
our results to vaHdate the analytical description. 



B. Degree distributions 

PGFs provide simple ways to evaluate secondary properties 
of a given state. For example, the node degree distribution and 
the community degree distribution. The former describes how 
many elements can be reached from a randomly chosen ele- 
ment, in other words, the number of links connected to this 
node in the network representation. The latter refers to a simi- 
lar concept, namely, the number of structures that overlap (by 
sharing elements) with one randomly chosen structure. 

To illustrate how this calculation is performed, one can sim- 
ply refer to the composition property of PGFs. We first pick a 
random element whose membership distribution is generated 
by N{x; t). For every possible value of its membership num- 
ber m, we sum over all possible cases for the different sizes 
of these structures. However, we know that all of these m 
structures have at least one element. It is thus k times more 
likely that one of these m structures is a structure of size k 
than a structure of size one. Furthermore, we do not want to 
count the initial element we chose, and will thus reduce the 
size of each structure by one. Hence, their size distribution is 
not generated by S{x; t), but instead by S'ix; t)/S'{l; t), where 
the denominator acts as a normalisation factor. Knowing that 
the convolution of two sequences is generated by the product 
of the corresponding PGFs, one can take the m-th power of the 
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FIG. 2: (color online). Time evolution of node-based SPA (s = 1) using q = 0.35 and p = 0.65 for the four main characteristics of the topology: 
memberships, community size, node degree and community degree. Snapshots are taken when the systems reach (a) 250 structures, (b) 1000 
structures and (c) 25 000 structures. Shown by markers are Monte Carlo results averaged over 25 000 simulations; analytical predictions of 
Eqs. l[7| - ^10) are shown with continuous lines. 




FIG. 3: (color online). Convergence of the time evolution governed by Eqs. |7]( - \10\ toward the equilibrium predicted by Eq. [15] for (a) 
the membership distribution, (b) the size distribution, (c) node degree distribution and (d) community degree distribution in node-based SPA 
(s = 1) using q = 0.6 and p = 0.25. 



new size PGF to obtain the PGF for the sum of m structures. 
Finally, we sum over all possible values of m to obtain 171: 

D(x;t)^J]N„[S'(x;t)/S'(l-t)]"' 

m 

= /^([^'(x;f)/^'(l;f)],f) ■ (9) 

Using the same logic for structures and their community de- 
gree, one can write: 



N'(x;t) N'(l-t) 



(10) 



The self-similarity between different levels of organization 
in the systems created by SPA stems from the similarity be- 
tween Eqs. (j9]l and (10 1. As long as N(x; t) and Six; t) are 
similar, the various possible compositions, which represent 
different organization properties, will also be similar. 

The validation of our analytical description for the time 
evolution of SPA is presented on Fig. |2] using Monte Carlo 



simulations. The initial conditions of all systems (i.e., the 
state of the system at f = 0), in both numerical simulation and 
analytical integration, consist of a single structure containing 
a single element; this remains true throughout the paper. Note 
that our calculations for the degree distributions are merely 
approximations because they suppose homogeneous mixing 
between elements and structures, while an element with m - i 
might not see exactly the same size distribution as an element 
with m - j. Such element-structure correlations are investi- 



gated in Sec. IIIC 



C. Statistical equilibrium 

The statistical equilibrium can be imposed by setting 
N(x; f -H 1) = N(x, t) = N*(x) and S(x, t+ 1) = S{x, t) = S*(x) 



4 



comm. degree 
membership 
degree 
size 




10 10" 

quantity 



FIG. 4: (color online). Validation of Eqs. ( |18^ and l |19^ as predictions 
for the asymptotic scaling behaviors of the main statistical distribu- 
tions (dashed lines: steady-state solutions, continuous line: scaling 
predictions) for node-based SPA (s = 1) using q = 0.6 (jn = 7/2) 
and p = 0.25 (75 = 7/3). 



in Eqs. (j7]i and ([8]l, yielding: 



N*(x) = r,x(x-l)—N*(x) + x; (11) 
dx 



S*(x) = n,x (x - 1) —S*(x) + x' 
dx 



(12) 



These ordinary differential equations can be solved straight- 
forwardly to obtain their solutions in terms of hypergeometric 
functions of the form 2F1 (a, b; c; x): 



(13) 



and: 



S\X) ' ' 1,,;(,+ 1)+ ;^ . (14) 

1 + sQ.s \ I 

The statistical equilibrium for the two distributions of interest 
can now be obtained through the power series coefficients of 
these two functions: 



k=l '^^ .5 e*/ ^ Y[k=s /ic;s 



reLid + ^r,) 



U'L (1 + 



These solutions for the asymptotic behavior of the statistical 
distributions can be validated through comparison with the 
long term behavior of our predicted time evolution, as done 
in Fig. [3] 

D. Scaling behavior 

From PA, it is well known that the A^,* and S * distributions 
will fall as power laws, i.e., 

oc m-^" and S*„ oc «-» . (16) 

To calculate the scaling exponent jn, we can evaluate the fol- 



lowing ratio using Eq. ( 15 1 



N* 

Hm — ^ = lim 

m^oo A^* m^oo 
m-1 



\m-l) 



» (m-l)r, 

m— >oo 



lim 7 - - (17) 
1 + mis 



from which it follows that 

log((m-l)r,^(l+mr,)) 



log 



^(m - 1) jntj 



= 1 + p ■ (18) 

t .r 



Similarly, one can directly write for structures: 

1 



rs = 1 + 



Q, 



(19) 



The node and community degree distributions, as composi- 
tions of two power-law distributions, will fall as the slower of 
the two original distributions. Noting that N'(x, t) and S'{x, t) 
will follow jM' =7^-1 and js' =75-1 because of the 
deiivative, we obtain: 

Jo = min|7A.,ys - l| and yc = minjyA, - l,ys| . (20) 

These results are validated on Fig. |4] 

III. APPROXIMATIONS AND LIMITATIONS 

To complete our description of the SPA process, this section 
examines some approximations that have either proven useful 
when reproducing empirical data with the SPA process or that 
coiTespond to limitations of the present formalism. 

A. Correspondence between system bases 

Some systems reproduced in |[T] with node-based SPA (s - 
1) are actually link-based, for example the author collabora- 
tion network of the cond-mat arXiv, where authors only ap- 
pear once they have at least one collaboration. The link be- 
tween node and link-based SPA is done by ignoring structures 
of size one when compiling the final system. 

In dl], we mention that the system base s was not a pa- 
rameter of the model per se, but depends on the information 
available or on the nature of the system. For instance, the 
World-Wide Web is mapped by following links between web- 
pages, such that it is impossible to find a page with no links. 
The smallest structural unit is thus the link and not the web- 
page itself: it is a link-based system {s - 2). Similarly, the 
author collaboration network of the cond-mat arXiv is built 
through collaborations and thus excludes authors without any 
links. Despite this fact, it can modelled through node-based 
SPA by ignoring structures of size one at the very end of the 
process. Furthermore, structures of size one can rarely be de- 
tected in network data if they are not completely disconnected 
from the rest of the systems. Hence, it is useful to be able 
to ignore these structures at the end of the stochastic growth 
process, independently of the system base. 

For the size distribution, ignoring structures of size one 
simply implies a renormalization for structures of size two or 
greater Noting the PGF for an approximate link-based SPA 
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FIG. 5: (color online), (a) and (b) Analytical predictions (lines) and simulations (markers) for a) an approximation of link-based system using 
a node-based SPA process and b) the link-based SPA using the same parameters as in (a), (c) Community structure of the cond-mat arXiv as 
measured by a link community algorithm |5| (dots) and as modelled by a link-based SPA (q2 = 0.95; p2 = 0.39) in continuous lines or by a 
node-based SPA which approximates the link-based system (qi = 0.68 and pi = 0.56 according to Eq. \24\ ) in dashed lines. The two black 
lines perfectly overlap, while the node-based membership distribution is slightly shifted by the use of approximation ( |22) . 



S^^(x) using the original node-based functions S\(x), we can 



write: 



c.app/ s '5i(j[:) — S ix 



(21) 



For the membership distribution, once again assuming homo- 
geneous mixing, we must randomly remove the fraction of 
memberships which corresponds to the structures of size one. 
Using the composition of PGFs, this can be done by compos- 
ing the membership PGP with the PGF for a binomial trial: 



_ M(x(l-e) + e)-M(e) 
^ 1-M(6) 



(22) 



where Ni (e) corresponds to the elements left with no mem- 
berships and thus need to be removed from the system. This 
trial will remove a fraction e of memberships, where e corre- 
sponds to the fraction of memberships which are associated 
with structures of size one: 



^1 ^ S\(0) 



(23) 



The validity of this approximate description and the effects of 
switching between system bases are illustrated on Fig. |5] Note 
how changing the system base, while keeping the parameters 
constant, greatly modifies the produced system. This high- 



lights both the validity of Eqs. (21 1 and (22i (which feature 
two levels of approximation of homogeneous mixing) and the 
importance of considering the influence of the system base on 
the scaling behavior 

To compare the results of approximated and actual link- 
based SPA for the same community structure, we first need 
to identify the relation between the parameter pairs {q\,p\} 
and {q2,P2} which is such that Fi - r2 and Qi - Q.2. From 
Eq. (j6]l, we obtain: 



P2 = 



Pi . 2^1 

and q2 - 



2-p 



2- pi 



(24) 



While it is easily verified that ignoring structures of size 
one in node-based SPA can result in statistical features simi- 



lar to that of link-based SPA (see Fig. 5(c) i, there exists one 



particularly important structural difference between these two 
kinds of systems. Mainly, a true link-based system is neces- 
sarily fully connected as each new elements creates at least 
one link with the old elements, while node-based systems can 
create many disconnected components that may or may not 
end up interconnecting through new structures (depending on 
q and p). In real link-based systems, there is no restriction on 
connectedness. For instance, the cond-mat arXiv network of 
co-authors has one giant component which consists of ~ 93% 
of the system, but other smaller satellite components still ex- 
ist. While both SPA versions illustrated on Fig. 5(c) create a 



similar community structure as the cond-mat arXiv, the node- 
based version is actually closer to reality. 



B. Multiple memberships, multiple links and self-loops 

In our description of the time evolution of SPA, we have 
never explicitly forbidden an element to join the same struc- 
ture more than once. These multiple memberships, whose 
likelihood depends directly on the value of the p ox q pa- 
rameters, lead to multiple links between the same individuals 
and self-loops (where an element shares a structure with it- 
self). Similarly, in our derivation of the degree distributions, 
we have supposed an infinite system where the probabilities 
that two structures overlap by more than one element fall to 
zero. 

In empirical data, multiple links and self-loop are rarely 
considered. It can thus be useful to have an idea of the effect of 
such restrictions on SPA. Fig. |6]presents two snapshots of the 
same scenarios of SPA, with or without forbidding multiple 
memberships, multiple links and self-loops when analyzing 
the final stage of the system. The cutoffs in the distributions 
of the first system are not surprising, as large and old struc- 
tures are very likely to have recruited the same element more 
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FIG. 6: (color online), (a) and (b) Comparison between the time evolution data presented in Fig. |2](dots) and the same data when multiple 
memberships, multiple links and self-loops are discarded (lines) for systems with (a) 250 structures and (b) 25 000 structures. Multiple 
memberships, multiple links and self-loops are finite size efi"ects which become negligible in the large-size limit. 
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FIG. 7: (color online). Size distribution of structures as seen from 
elements with different m memberships. Markers represent empirical 
measures done on the cond-mat arXiv and numerical results on the 
two SPA processes (using the parameters of Fig. |5(c)[ (. The dashed 
line corresponds to what would be obtained through homogeneous 
pairing of memberships and structures. 



C. Element-structure correlations 

Most of the approximations used throughout this paper are 
based on the assumption of homogeneous mixing: the ele- 
ments belonging to a number x of structures see the same size 
distribution as the elements belonging to y structures. This 
implies that there is no correlations except for the fact that an 
element is x times more likely to belong to a given structure of 
size X than to a particular structure of size one {natural corre- 
lations). To investigate this matter, we compare the size distri- 
butions as seen from elements with different memberships in 



both the simulations done for Fig. 5(c) and the corresponding 
arXiv data. 

Figure |7]presents the results of this investigation. First, the 
similitude between SPA and homogeneous mixing explains 
why our approximations were accurate. The small difference 
between the node-based and link-based SPA processes is most 
likely due to the fact that the link-based version requires more 
elements for the birth of new structures, which are conse- 
quently more likely to be old elements than in the node-based 
version. Second, there is a major difference between element- 
structure correlations in real-systems and SPA: elements with 
few memberships are much more likely to belong to larger 
structures in the arXiv data than in our SPA simulations. This 
shows how other levels of organization have yet to be taken 
into account in our stochastic models. Depending on what 
one wants to model, these correlations could potentially be 
important. 



IV. PELOTON DYNAMICS 



than once, especially with a small q. Yet, this effect rapidly be- 
comes negligible as the system grows and we enter the large 
size limit in accordance with the assumptions of our analytical 



description (see Fig. 6(b) i 



One particularly interesting feature of the results presented 
in Fig. |2] and |3] is the dynamics of the entities in the tail 
of the distributions. In fact, these groups of individuals or 
structures resulted in clearly identifiable bulges on their re- 
spective distributions. The dynamics of a system's leader is 
well-documented in the context of growing networks ||8] |9l 
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FIG. 8: (color online), (a) Comparison of the memberships and sizes distributions of node-based SPA with q = 0.8 and p = 0.2 in discrete and 
continuous dynamics at time t = 100. This illustrates how the peloton dynamics is a direct consequence of the maximal system size present 
only in the discrete version of the process, (b) The height of the peloton follows a power-law decay (here for the results of Fig. |3(b)[ l , such 
that its surface is conserved on a logarithmic scale as it evolves. The decay exponent of the peloton is the same as the scaling exponent of the 
distribution it creates, (c) Rescaled distribution {n7'S„(t)] as a function of rescaled community size n/t'-^'' highlights the scaling of the peloton 
dynamics. 





FIG. 9: (color online). Distributions of words by their number of occurrences in prose samples of different length taken from the complete 
works of (a) H.P. Lovecraft composed of nearly 800 000 words, (b) William Shakespeare with around 900 000 words and (c) Herman Melville 
with over 1 200 000 words. The peloton dynamics is manifest in all distributions, (d) The rescaling method of Fig. |8(c)[ with y = 2.27 and 
1 - p = 0.43, is applied to the statistics of Herman Melville's work. 
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or word frequencies flOl, but can be applied to any problem 
where one is interested in the statistics of the extremes (i.e., 
the growth of the biggest business firm, of the most popular 
website, etc.). What we observe here is that averaging over 
multiple realizations of the same experiment will result in the 
creation of a peloton where one is significantly more likely to 
find entities than predicted by the asymptotic distribution (i.e., 
the leaders). 

The clear distinction between the statistical distribution of 
leaders versus the rest of the system is a consequence of the 
maximal size of the system and of the limited growth re- 
sources available. To illustrate this claim, we can consider 
a continuous time version of PA in which there is no finite 
limitation to the number of growth events at every time step 
(see Appendix [A] for explicit solution of this process). Com- 
paring the results of the discrete and continuous versions of 
our stochastic process on Fig. 8(a) illustrates how limiting 



growth resources results in the condensation of the leaders 
in a peloton. This draws a strong parallel between discrete 
preferential attachment and some sandpile models known to 
result in scale-free avalanche size distributions through self- 
organized criticality. In some cases, such as the Oslo model 
(see im §3.9), the biggest avalanches are limited by the size 
of the considered sandpile and are thus condensed in bulges 
identical to our pelotons. 

Also striking is the fact that this peloton conserves its shape 
on a log-log scale (see Fig. |8(b)[ ). To highlight this feature. 
Fig. 8(c) rescales the distributions to account for the scaling 
in size (js) and the peloton growth through time (?'"'', see Ap- 
pendix[B]for derivation). This rescaling method was borrowed 
from lUJ §3.9.8. 

Leaders emerge in every single preferential growth realiza- 
tion, while the peloton dynamics can only manifest itself once 
we average over multiple systems or over many characteristic 
time scales of a single system (through the births and deaths 
of many different leaders). Consequently, empirical observa- 
tions of this phenomenon are rare, because on the one hand 
we have only one Internet, one arXiv, and basically a unique 
copy of most complex systems, and on the other hand, we 
rarely have access to extensive data through long time scales. 
We can however find a solution if we go back to the first ex- 
ample used by Simon |2| to derive his model: the scale-free 
distribution of words by their number of occurrences in writ- 
ten text (i.e., Zipf's law ||12J ). In this context, q equals zero 
and the p parameter corresponds to the probability that each 
new written word has never been used before. We can there- 
fore consider diff'erent samples of text of equal length written 
by the same author as different realizations of the same exper- 
iment. 

With this in mind, we have picked different authors accord- 
ing to personal preferences and size of their body of work and 
divided their oeuvres in samples of given lengths which we 
then used to evaluate Zipf's law under averaging (see Fig. |9]l. 
As predicted by PA, taking the average of multiple realizations 
of the same experiment results in a peloton which diverges 
from the traditional Zipf's law. In this case, the peloton im- 
plies that the leaders of this system (i.e., the most frequent 
words) consistently fall in the same scale of occurrences. 



Lastly, Fig. |9(d)| reproduces the scaling analysis of Fig. 8(c) 
for empirical results on prose samples. The varying surface of 
the peloton hints at a non-constant growth rate: a well-known 
feature of written text (see 113] §7.5). 



V. CONCLUSION 

In this paper, several analytical results for structural pref- 
erential attachment have been obtained: solutions for its time 
evolution and asymptotic behavior as well as approximations 
for its different degree distributions. Those approximate de- 
scriptions are especially useful when it comes to using orga- 
nization models as part of modelling efforts. 

We have also highlighted one particular shortcoming of the 
model: element-structure correlations. That is, SPA lacks any 
modelling or predictive power when it comes to asking who 
belongs to what structure. 

On the other hand, we have observed an interesting behav- 
ior of both the SPA and the classic PA models: the peloton 
dynamics. This particular feature is important in order to pre- 
dict the position of the leaders of a PA growth process. More 
interestingly, we have been able to observe this behavior in 
the growth of prose samples, which differentiates the PA prin- 
ciple from the other models generating scale-free designs but 
failing to predict this property. 

The presentation of shortcomings and successes of the SPA 
principle (in terms of predictive value) shows the importance 
and the need for further study in stochastic growth models. 
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Appendix A: Explicit solution to continuous time SPA 



Section IV has presented an explicit solution for the time 



evolution of SPA in continuous time. This Appendix summa- 
rizes its derivation, based on a recently proposed method II 141 . 



1. Definition of a continuous time PA process 

The transition to continuous time simply implies that q and 
p now refer to birth rates for both elements and structures. The 
corresponding rates \ - q and \ - p thereby correspond to the 
growth rates of existing elements and structures, respectively. 
This means that in a given time interval [t,t -\- 1], this new 
stochastic process could create an infinite number of elements 
with probability lim,;,^o iqdt)^^'''; whereas the discrete version 
could only create one element with probability q. While it is 
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highly improbable that continuous time PA results in a system 
several orders of magnitude larger than qt or pt, there is no 
maximal size per se. 

This sort of continuous time dynamics is better described 
using simple ODEs, or master equations, as was done in fl]. 
To this end, we once again follow Nm, the number of elements 
with m memberships, and S„, the number of structures enclos- 
ing n elements. Using the same logic behind Eqs. ([T]) and (j2|i, 
but considering infinitesimal time steps dt, one can write 

N^d + dt)= N,n(t)+ dti^j ((m-lW„,_i(f) -mN,„(t)) +q6,„i 



and 



Sn(t + dt)^ S„{t)+ dt\-j- [{n-\)Sn-\{t) -nS„(t)) +p5„., 
which are straightforwardly rewritten as two ODEs: 



jNmit) =j{im- m,n-iit) - mN,„(t)) + q6,„i ; (Al) 



j^S„(t) = y ((« - l)5„-i(f) - nS„(t)) +p6„, . (A2) 

Because these two last equations have the same form, we solve 
them separately using a general continuous time PA equation. 
Consider 



dt 



Pk(t) ^ /36k,n + Rk-i(t)Pk-i(t) - Rk(t)Pk(t) 



(A3) 



where /3 is the birth rate, m is the size of new entities and /?,(?) 
is the attachment rate on entities of size /, which we define us- 
ing a growth rate a, an initial total size mo and a normalization 
rate A: 



Ri(t) 



mo + At 



(A4) 



It proves useful to rewrite ( A3 i in dimensionless form as 
d 



dT 



PkiT) ^I36k„, + Rk-x(T)Pk-x(T) - RkiT)Pk(T) (A5) 



with dimensionless time t = at, parameters /3 - /3/a, A - 
A/ a, and attachment rate Rkir) = k/(mo + At) respectively. 
Table |7] gives the values of the different parameters for the 
classical PA models and for SPA. 





PA 


SPA 




Simon 


BA 


elements 


structures 




W(i-p) 


l/m 


q/a 


p/a 


a 




m 


l-q + p(s - 1) 


1-p 


A/a 


1/(1 -p) 


2 


[l+p{s-l)]/a 


[l+p(s-Wa 


m 


1 


111 


1 


s 



TABLE I: Parameters of tire general PA process (Eq. |A5^ in the 
context of Simon's model [2J, of the Barabasi- Albert model (BA) [3J 
and of SPA. 



2. Explicit solution 



Let 



Hk(t) = exp 



Rk(T)dT 



\in() + At) 



kiA 



(A6) 



so that Eq. ( A5 i can be written as: 
d 



dT ' 



[Pk(T)Hk{T)\ ^ /3Hk(T)6k,„ + Rk-i(T)Hk(T)Pk-i(T) . (A7) 
The general solution of this transformed equation is: 

— (mr\ 4- At^ 

k + A 

(l-_5k^ R,_^(r)Hk(T)Pk-i(T)dT + CkiA8) 
tk(r) 



Hk(T) J 



where {Ck} are constants of integration determined by the ini- 
tial conditions. Solving for the first few values of k (m, m+ 1, 
m + 2, . . .) reveals the following pattern for the solutions: 



Pm+k(T) = /3- 



(m)k 



(m + A) 



k+l 



■ (mo + At^ 



('n)k C,„+i j - Y 



-(m+i)/A 



(A9) 



where (■y)j = (■y){y+ 1) . . . (y-H j- 1) are Pochammer symbols. 
The last step towards a complete solution is to determine an 
explicit form of the constants of integrations {C,„+k} in terms 
of the initial conditions {P,„+k(0)}. This is easily accomplished 



by writing ( A9 1 in a matrix form for the vector of initial con- 
ditions P(0) 



P(0) = A(0) + L(0)C 



(AlO) 



in terms of the vector C of integration constants and a 
lower triangular matrix L, followed by the observation that 
the inverse of a (lower/upper) triangular matrix is also a 
(lower/upper) triangular matrix whose elements can be con- 
structed by forward substitution. Given that the elements of 
L(0) are 



-'m+k,m+i 



(0) 



m + k - 1 
m + i - \ I m; 







(All) 



we find that the elements of the inverse matrix, denoted M, 
are simply 



= (-1)' 



k-i 



;(m + k — 1 
m H- / — 1 



Inserting this solution in ( A9 1, we get 

P(t) = [A(t) - L(t)MA(0)] + L(t)MP(0) , 



(A12) 



(A13) 



which nicely isolates the principal dynamics (the first 2 terms) 
from the initial conditions. Specifically, by imposing the usual 
initial conditions, P„+k(0) = dko, it is straightforward, albeit 
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somewhat lengthy, to obtain a closed-form expression for the 
complete dynamical elements as 



Pm+k(j) = PmQ(m)k 
1 



1 



(m + A)k+i 
1 



-X(t) 



-X(TrF,(X(T)) 



r(k + 1) 



(A14) 



with X(t) - mo/iniQ + At) and where Fk{X) - 2F\{-k,m + 
A\m+A+\;X) represents a terminating hypergeometric series 
of degree k. One verifies that, by setting r = in the previous 
expression, one obtains Pm+i<{Q) = ^^o as it should. 

It can further be shown that the continuous and discrete time 
versions of PA converge toward the same asymptotic behavior 



Appendix B: Scaling exponents in the peloton dynamics 



It has been seen in Fig. 8(c) that the probabiUty distribution 
P(x; f) follows the scaling relation 



P(x) oc x'^P{xlf{t); f » 1) , 



(Bl) 



where y is either equal to jm for elements or js for structures. 
This Appendix derives the growth function, /(f), describing 
the mean state of a single entity (e.g., its number of occurences 
or its size) at time t within a system whose global growth is 
governed by PA. Once again, because we follow mean quan- 
tities, the process is deterministic. 

Without loss of generality, we suppose that only one entity 
is present at time t - 1, such that always exactly t events 



will have occured by time t. This simplifies the normalization 
of transition probability and we can thus write the effect of a 
general PA step on a single entity as: 



/(f+1) 



P + a 



t-f(t) 



fit) 



+ a^(/(f) + l). (B2) 



For the node-based cases, a further simplification arises, a + 
/? = 1, yielding a recursive rule for the growth function /(f); 



/(f+l) = (l + y)/(f) 



(B3) 



which directly fixes the derivative in the limit of large f : 

1/(0 = 7/(0 ■ (B4) 
at t 



(B5) 



The general solution to Eq. ([B4]l is: 

/(f) ^Af + B. 



For the original entity, /(I) = 1, which is destined to be the 
leader of this deterministic process, one obtains the following 
mean position at time f : 

/(f) = f . (B6) 
Equation (|B6|l dictates the evolution of the leader's position 



and thus fixes the renormalization used in Fig. 8(c) Once 
again, one can refer to Tab. |l]for the values of a in different 
PA models. 
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